Conversation
|
✅ DCO Check Passed Thanks @ceberam, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
295c5ac to
0c201e2
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
9e26200 to
a0f4404
Compare
c2efd15 to
4ecde67
Compare
c56d14a to
c56a691
Compare
19a2dbe to
5e0a787
Compare
There was a problem hiding this comment.
- Align new union naming to differentiate from "legacy" prov (e.g. call it
SourceType?), also aligned with new field - Unroll the
tagsfield into the individual fields we consider relevant for now, e.g. "voice" / "id" (I don't think we need "classes" now, as it's quite VTT-specific). The fact that we would be losing some information on VTT is actually consistent with other import/export paths (e.g. HTML with embedded CSS). - validate upon assignment could be considered outside this PR
DoclingDocument.add_text()can be extended with a new optional param forsource
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
…le types Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Since WebVTTTimestamp is used in DoclingDocument, the class should be public. Strengthen validation of cue language start tag annotation. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Add a DoclingDocument serializer to WebVTT format. Improve WebVTT data model. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Add 'text/vtt' as extra MIME type to support WebVTT serialization, since it is not supported by 'mimetypes' with python < 3.11 Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Classes and fields that are related to the new source type should aign with their names. The term 'provenance' will identify the legacy implementation. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Drop the validation on field assignment in NodeItem objects. Add the 'source' argument in the convenient function 'add_text' to create TextItem with track source data. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> refactor(webvtt): drop cue span classes, 'lang' and 'c' tags Drop WebVTT formatting features not covered by Docling across formats. Only 'u', 'b', 'i', and 'v' are supported and without classes. Make 'v' tag explicit as 'voice' feature in SourceTrack class. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
|
Thanks @vagenas for the review.
OK ✅
may now imply loss of information, even though the resulting file will still be a valid WebVTT according to the specs.
OK ✅ |
This PR introduces a new type of provenance object for media files.
ProvenanceTrackfor media tracksWebVTTfor reusedocling, check the draft PR feat: webvtt and source tracker docling#2787It also addresses docling-project/docling#2525